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ABSTRACT 
DNA‘s genetic code can be represented as an alphabetic sequence composed of the 
four letters A, C, G, and T, which represent the four types of nucleotides - 
adenylic, cytidylic, guanylic, and thymidylic acid - of which DNA is composed. 
Now that these sequences have been identified for many genes and are available 
in computer-readable form, scientists can analyze these data and search for pat- 
terns in an attempt to learn more about the regulatory functions of the gene. 
One area of study is that of the frequency of occurrence of specific nucleotide 
subsequences (e.9., ACAC) within part or all of a nucleotide sequence. This 
paper derives the probability distribution"“of the frequency of occurrence of a 
subsequence within a nucleotide sequence, under the hypothesis that the four 
nucleotides occur at random and with equal probability. This distribution is 
nontrivial because different subsequences have different "overlap capability." 
For example, the subsequence AAAA can occur up to 17 times in a sequence of 
length 20 (which would happen if the sequence were composed solely of A’s), but 
the subsequence ACGT cannot occur more than 5 times in a sequence of length 20. 
Thus, the frequency distributions are different for each type of overlap 
Capability. It is of interest to assess and compare the degree of nonrandomness 
for different subsequences or among different portions of a sequence; the ex- 
istence and degree of nonrandomness may be related to the type and degree of 
functionality of a nucleotide (sub)sequence. Using the frequency distributions 
provided here, exact significance tests of the hypothesis of randomness can be 
performed. An approximate test is also described for use with long sequences; 
this can be used to test a more general null hypothesis of nucleotides occurring 


with unequal probabilities. 


1. INTRODUCTION. 
Genes are long, double-stranded, helical molecules of DNA. Each of the two 


strands contains a sequence of nucleotides - typically between 1,500 and 15,000 


of them - and the two strands are loosely bound together by hydrogen bonds. ) 
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nucleotida in DNA is identified according to which of four nitrogenous bases it 
contains: a purine base, adenine (A) or guanine (G), or a pyrimidine base, 
cytosine (C) or thymine (T). There is a one-to-one correspondence between 
nucleotides on opposite strands; an A, G, C, or T.on one strand is weakly bonded 
to its complement, T, C, G, or A, respectively, on the other. DNA‘s so-called 
"genetic code" can thus be represented as a single alphabetic sequence composed 
of these four letters. It is by means of this code that the gene controls the 
formation of other substances in the cell (see, e.g., Guyton (19469)). Also, 
certain sequences of nucleotides form oncogenic genes which can initiate some 
forms of cancer. 

Relatively recent advances in biochemistry have made it possible to deter- 
mine the nucleotide sequences for large numbers of genes, and for intergenic 
material, which also consists of nucleotide sequences. Such data are now 
available in computer-readable form, so it is possible to look for and analyze 
patterns within sequences using statistical and statistical computing tech- 
niques. Scientists are now able to use pattern recognition algorithms to "learn 
more about the regulatory nature of the various genetic functional domains, and 
more about what it is that is recognized within those domains by the cellular 
hardware..." (Sadler, Waterman, and Smith (1983)). Weir (1985) provides a use- 
ful survey of new problems of statistical analysis which have arisen following 
advances in molecular genetics. 

One area of study is that of the frequency of occurrence of specific short 
nucleotide subsequences (e.g., ACAC) within part or all of a nucleotide se- . 
quence. Maizel et al. (1981) concluded that, among the computer programs being 
widely used for nucleic acid analysis, "Most frequently used are programs that 
search for occurrences of short subsequences that are used by enzymes as signals 
to recognize, modify, and express nucleic acids, that determine the frequency 
and locations of short strings of nucleotides, and that translate nucleic acid 
sequences into amino acid sequences or complementary polynucleotide strands." 
Some examples of papers which deal with the analysis of subsequence frequencies 
are: Aquadro and Greenberg (1983), Gentleman et al. (1984), Grantham et al. 
(1981), Harr, Haggstrom, and Gustafsson (1983), Korn and Queen (1984), Nussinov 
(1984), Sadler et al. (1983), Smith and Burks (1983), Smith, Waterman, and 
Sadler (1983), Queen and Korn (1980), and Vass and Wilson (1984). 

It is of interest to assess and compare the degree of nonrandomness for 
different subsequences or among different portions of a sequence; the existence 


and degree of nonrandomness may be related to the type and degree of func- 
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tionality of a nucleotide (sub)sequence. Orgel and Crick (1980) classified the 
DNA of higher organisms as falling into two classes, one specific and the other 
comparatively nonspecific. Regarding the latter, they noted that "there is a 
large amount of evidence which suggests, but does not prove, that much DNA in 
higher organisms is little better than junk." Vass and Wilson (1984) describe 
statistical tests for detecting nonrandom arrangements on a nucleotide strand. 
These tests "may be especially useful in the analysis of patterns in DNA se- 
quences which may partly reflect the structure and function of the genes in 
which they are part." Shukla and Srivastava (1985) developed a test for sequence 
randomness based on the frequency of occurrence of a particular subsequence at 
two positions which are a fixed number of bases apart. They reasoned that "low 
probability of chance occurrence calls for further exloration...of...possible 
structural or functional significance, ...(whereas, if there is) a high 
probability of chance occurrence, then one has to exercise some caution before 
attaching any structural or functional role to that kind of...repeat." 

Gentleman et al. (1984) gave two examples of how the occurrence of a subse- 
quence in unexpectedly large numbers may provide information about structure or 
functions (1) This phenomenon may reflect the fact that that segment of DNA was 
Originally formed by replication of smaller segments. Such replication might be 
exact initially, but might be altered with time. But in regions of the sequence 
where retention of function is necessary, exact repeats would be expected to oc- 
cur. (2) A break in DNA usually occurs at different positions, frome ns uitow [0 
nucleotides apart, on the two strands. When this happens, a gap is created on 
each strand opposite remaining nucleotides on the other strand. On each strand, 
to fill this gap, nucleotides complementary to their counterparts on the other 
strand are added, thus duplicating the subsequence on the other side of the 
break. The break itself is filled in. The occurrence of repeated subsequences 
may therefore identify locations where an insert has been introduced into the 
DNA sequence. 

Relatively little is know about the specific functionality of DNA. The at- 
tempt to identify of nucleotide (sub) sequences of greater or lesser randomness 
is based in part on the concept that a higher degree of functionality may be in- 
dicated by a lower degree of randomness. The complementarity of the two strands 
of DNA permits each strand to act as a template on cell division to form two 
identical double stranded structures. As DNA reproduces itself, chance muta- 
tions may occur, so that DNA is subject to the forces of natural selection and 


evolution. Thus, a particular configuration in DNA that exists now may have 
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been favored by natural selection. (For further discussion of genetic evolu- 
tion, see Doolittle and Sapienza (1980), Orgel and Crick (1980), and Forbes and 
Shadbolt-Forbes (1988).) 

This paper derives the probability distribution of the frequency of occur- 
rence of a subsequence within a nucleotide sequence, under the hypothesis that 
the four nucleotides occur at random and with equal probability. A general al- 
gorithm is provided for calculating this probability distribution, which depends 
on the sequence length, the subsequence length, and a property of the subse- 
quence which will be termed "overlap capability". Explicit formulas are given 
for all subsequences of length 2-8. Access to these distributions permits the 
use of exact significance tests of the hypothesis of randomness. An approximate 
test is also provided for use when the sequence is long. The observed sig- 
nificance level of such tests measures the extent of the data’s departure from 
the hypothesis, i.e., the degree of nonrandomness. 

A null hypothesis of equiprobable occurrence of the different nucleotides 
is reasonable in the context of the present DNA structures having evolved from a 
"primordial soup" or "base pool" containing equal quantities of each base. This 
is discussed by Sege and Saxberg (1982), who provide a statistical test for the 
Simultaneous comparison of several nucleotide subsequences. Their “null 
hypothesis which one seeks to reject” is that the observed data came by chance 
selection from a base pool with specified relative frequencies of A, C, G, and 
T. They describe three alternatives for choosing the four null probabilities: 

"(1) The abstract nucleotide pool is unlimited and therefore the 
distribution of nucleotides is effectively equal; (2) the se- 
quences are drawn from a pool comprised of the nucleotide 
distribution typical of the species; or (3) the nucleotide pool 
for the class of sequences examined is well represented by the 
total distribution of nucleotides in the sequences themselves." 
Sege and Saxberg then discuss the conditions under which each choice is ap- 
propriate: 
"The virtual pool selected will be a function of the question 
posed by the experimenter and the level of information desired. 
Clearly the most readily interpreted virtual pools are the even 
virtual pool (frequencies = 0.25) and the ‘species/organism’ 
virtual pool (average frequencies of bases for the 
species/organism). These virtual pools should serve as_  stan- 


dards unless the investigator has sufficient reason to warrant 
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another type (e.g., experimental virtual pool). When a non- 

standard virtual pool is used, its justification and the meaning 

of the resulting (significance levels) must be carefully con- 

sidered." be 
The exact distributions provided in this paper can be used to test the first of 
Sege and Saxberg’s alternatives, and the approximate test can be used to test 
any of the three alternatives. Ea 

A model in which the four types of nucleotides occur independently has been 
assumed by some researchers (e.g., those cited in Biggins and Cannings (1987, p. 
321)), and hypothesized by others. In the latter case, the hypothesis has been 
accepted in numerous situations, particularly in the analysis of relatively 
short (sub)sequences. Garden (1980) fitted Markov chain models to three DNA/RNA 
sequences, finding that Markov models of order three, two, and zero fitted best. 
(In RNA, the pyrimidine uracil appears instead of thymine.) The zeroth-order 
model fitted a gene of length 14632. Fuchs (1980) speculated that the length of 
the sequence is directly related to the order, citing Garden's further results 
for subsequences to support this. Fuchs noted that "the majority of the 
500-nucleotide segments were fitted well by a model of order zero or one, as ex- 
pected for short sequences." He recommended two types of supplementary 
analyses: detection of anomalous regions in a sequence, and analysis of devia- 
tions between the observed and expected frequencies of nucleotide subsequences. 
Section 2 below defines the concept of "overlap capability", a property of 

a subsequence which complicates the probability function for the frequency of 
occurrence of the subsequence. Section 3 derives the expectation and variance 
of this random variable. The probability function - which is different for each 
type of overlap capability - is derived in Section 4 (with further details in 
the Appendix). Section 5 provides examples, using a human genome sequence, of 


the use of the probability function in exact and approximate significance tests 


of randomness. 


2. DEFINITION OF OVERLAP CAPABILITY. 

Assume that the four nucleotides which make up a subsequence occur indepen- 
dently and with equal probability, so that the probability p of the occurrence 
of a subsequence of length L is (Valen Let the random variable X be the 
frequency of occurrence of a nucleotide subsequence of length L within a 
nucleotide sequence of length M. A subsequence "occurs at position i" if it is 
found to begin at position i. Then n = M-L+i is the maximum value achievable by 


X. Let £(xs3l,M,Q@) be the probability function of X. This depends not only on 
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the scalars L and M, but also on the vector Q, which represents the "overlap 
capability" of the specific subsequence. As a simple example, the subsequence 
AAAR can occur between 0 and 17 times within a sequence of length 20. The 
subsequence ACAC cannot occur more than 9 times in a sequence of length 20 
because it has less overlap capability. The subsequence ACGT has no overlap 
capability except in the trivial case when it is superimposed on itself, so it 
Cannot occur more than 5 times. Thus, #(x3;L,M,Q) cannot be treated as a bino- 
mial distribution involving independent trials. 

Define overlap capability as follows: Let S be a given subsequence and Sys 
Sgyeecyo, be letters representing its L nucleotides from left to right. Then 
represent the overlap capability Q of S as a binary sequence Q,, Q,,...,Q, such 
that Q,=1 if it is possible for the subsequence’s first i letters to overlap its 
last i letters, and Q;=0 otherwise (for i=1,...,L). Specifically, iat 
Sy Skates for k=1,...,i, and Q@-=0 otherwise (for i=1,...,L). Obviously, Q,=1 
because a subsequence can always overlap its entire self. For example, the 
subsequence ACAC has overlap capability 0,1,0,1, and the subsequence AAAC has 
Overlap capability 90,9,0,1. Clearly, many subsequences can have the same over- 
lap capability. On the other hand, not all at possible binary sequences of 
length L yield possible Q’s, due to interrelationships among the elements of Q; 
for example, no subsequence can have overlap capability 1,0,1,1 (because Q,=1 
implies that all elements of S are the same, but @,30 implies the existence of 
some inequalities among them). For L=2 to L=8, for example, there are, respec- 
tively, only 2, 3, 4, 6, 8, 10, and 13 possible overlap capabilities. An al- 
gorithm that can be used to test a binary sequence to determine if it is a 
possible overlap capability is given in Guibas and Odlyzko (1981). An algorithm 
to generate all possible overlap capabilties given L is described in Gentleman 
and Mullin (1987). 

The above model can be described in the terminology of Markov chains. 
Feller (1950, p. 376, Problem 1) descibed a special case of this situation (for 
L2=2 and a two-letter alphabet) as a four-state, first-order Markov process. 
That approach generalizes here to an a’-state, (L-1)-order Markov process (where 
a is the number of letters in the genetic alphabet). Then in particular, the 
transition probability of the occurrence of S, given that S occurred L-k posi- 
tions before, is athe for k=1,...,L-1 (where pa(i/a)"), (This i8 easily 
further generalized to the case of letters having unequal probabilities.) 

Overlap capability enters into the discussion by Biggins and Cannings 


(1987) of restriction enzymes which cut DNA sequences whenever certain specific 
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subsequences occur. If one of these subsequences overlaps itself or another of 
the recognizable subsequences, the cut occurs only at the site of the "earlier" 
occurrence. For other examples of analyses incorporating the concept of overlap 


Capability, see Shukla and Srivastava (1985) and Karlin and Ost (1987). 
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3. DERIVATION OF EXPECTATION AND VARIANCE 


The expectation of X is, perhaps counterintuitively, independent of the 
subsequence's overlap capability (Q). The variance depends on Q and, as would 
be expected, is larger for subsequences with a "higher degree" of overlap 
Capability. For example, if the sequence length M=20 and the subsequence 
length L=4, then for the subsequences AAAA, ACAC, and ACGT, E(X)=.07 and V(X) 
LSe ies and, UOs respectively. 


To derive the expectation and variance of X, assume that M2L, and define 


indicator variables Y Y aun (where n=M-L+1) such that Y.=1 if 
i 


ie: ye 
the subsequence S occurs at position i, and ee otherwise (for is? ..8,n)s 


n 
Then X = > Y., so that 
rice 


n 
E(X) >? E(Y.). Since E(Y.) = (yt : DP, E(X) = np, independent of 
i= 
Q, and 
n n n n n n 
oe. poe ov ane eee EY ab ge Ey) oe EY) 
fT ja {=i jot izt i 
n n 
= E(Y.Y.) - nape Gly. 
eed : 


To obtain V(X), it is necessary to take account of the fact that the Y's 
are not in general independent of each other; covariances between Y's which 
are "near neighbors" depend on Q. To determine these covariances, quantities 
of the form SSA beat must be calculated. This is just the probability 
that the subsequence occurs at both position i and position i+tk. If 


; L+k k 
O£kSmin(L-1,n-1), then E(Yatat ig = Q (1/4) = pQ, _, (1/4) : 


2 


If min(L-1,n-1)<k<n, Q is irrelevant and EW tee ye=3G/4)-- = pe 


k 
n 


n 
There are n terms among the ie in = e. E(y ¥ .) Suchathnat, is.). 
Aah eG 
Also, if n>1 there are 2(n-k) terms such that j=it+k or i=j+k (for 
n-L 
Klpeoe a Minul=1 ni) ocmel fens cet Nere ares: k = (n-L)(n-L+1) remaining 


terms which do not depend on Q. Thus, 
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(L-1, 
SD ely.) MN een) daca k 
ECY..¥)2) = np +2pe(n-ls)(n-L+1) +/2p- n-k)Q, ,(1/4)", 
COTES E te pene, a fae 
so Eqn. (1) becomes 
min os 
(L-1, 
2 n-1) 
V(X) = np(1-np) + p (n-L)(n-L+1) + 2p > (n-k)Q, _1 (1/4) C20 
k= 


On the right hand side of Eqn. (2) and of the previous equation, the second 
term equals zero if aly and the third term equals zero if n=1. If L=1, the 
third term in Eqn. (2) equals zero, and V(X) reduces, as it should, to the 


variance np(1-p) of a binomial random variable. 


The formulas for E(X) and V(X) cn be generalized as follows for an 
arbitrary number a of letters in the alphabet, occurring independently with 
probabilities i (which sum to 1). Suppose the subsequence S 


consists of L letters with respective probabilities p. P, cea De » #let 
1 2 L 


k 
Pe = ne P; be the product of the probabilities of the first k letters in S. 
m m 


=] U 
Ini particular, Pe = wt Py = Pr(S)s Then, E(x) = nP,» and Eqns. (1) and (2) 
m m 


=] 
may be generalized by substituting P, for p, and P| for G/a).. 


B 


The contribution of Q to V(X) in Eqn. (2) motivates the following proce- 


dure for ranking subsequences in order of their degree of overlap capability: 


Let Q be the binary number constructed from the L elements of Q in reverse 


order. Then subsequence S. with overlap capability Q, has greater overlap 


1 


capability than subsequence S_ with overlap capability Q., if Q>@.. 


2 
Thus, for example, the following subsequences are listed in increasing order 


of overlap capability: ACGTT, ACGTA, ACGAC, AACAA, ACACA, AAAAA (for which Q 
= 10000, 10001, 10010, 10011, 10101, 11111, respectively). Using this ranking 


procedure, then for fixed L and n>1, V(X) increases with the degree of over- 


lap capability. 
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Examination of Eqn. (2) shows that for n>L, the maximum value of V(X) 
(achieved when all elements of Q are equal to 1), is greater than the variance 
np(1-p) of a binomial random variable, and the minimum V(X) (achieved when all 
elements of Q except q are equal to 0) is less than np(1-p). Asn +o, 

L-1 


V(X)-np(1-p) —9 2np ST (47*-p) if G21,1,...,1, and ¥(X)-np(1-p) —> 
k-1 
Bod nps if Q=0,0,...,0,1. Thus, the difference between the maximum and 
L-1 
minimum variance approaches 2np a Gigs as N—-o. For example, these three 
Kes 


limiting quantities are equal, respectively, to .023n, -.00781n, and .0313n 
for L=2, and to .00266n, -.0000916n, and .00256n for L=4. 


4. DERIVATION OF THE PROBABILITY FUNCTION 


Using combinatorial theory, an “fecpene formula can be derived for the 
probability generating function of X. This requires an appropriate applica- 
tion, as described in the Appendix, of the combinatorial techniques in Goulden 
and Jackson (1983, Section 2.8.). From the probability generating function, 
an algebraic formula for f(x;L,M,Q), recursive in x and M, can be obtained as 


shown below. A separate formula involving parameters L and M is required for 


each Q. 


The probability generating function P(u,v) for f(x;L,M,Q) is 
ones 
PCa vane f(x3L,M,Q)u vy 
M=0 x=0 


< 1-(v-1)h(u/4 (3). 
[1-(v-1)h(u/4) )(1-u)-(u/4) (v-1) 


This formula involves the "prefix polynomial" h(x), defined as 


L-1 
h(x) = ea xanga (Note that h(p) is the sum of the L-1 transition 

=i] 
probabilities described in Section 2.) In Eqn. (3), the "4", which appears 
three times, can be generalized to be the number of equiprobable letters of 


the alphabet. This also holds for Eqns. (4), (5), and (8) below. 
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f(x3L,M,Q) is the coefficient of vp ine Con. (oe To obtain a 
formula for f, write the denominator of the right hand side of Eqn. (3) as 


1-D, where 


Sie fu/5)s + (Wainy - (1-u+uv-v)h(u/4) e (4). 


Then multiply both sides of Eqn. (3) by 1-D and isolate P(u,v) on the left 


hand side to obtain 


PF eo 
Sa SE Aeaienne = 1-(v-1)h(u/4) + ie >? Tear ome: CS 
M=0 x=0 M=0 x=0 


If D is written as 


So that’.€#->. is the coefficient of Tint in Eqn. (4), then the coefficient 


a ewe sie Eqn. (5) is 


: 


L 
F(x3L,M,Q) => > C,, f(x-jsL,Mi,Q) (6). 
f=1 j=0 74 


This is obtained by applying the following boundary conditions: 


f(0;L,M,Q) = 1 if MSL; 
f(x;L,M,Q) = O if M<L and x>0 CTs 


The following general formulas for C obtained by expanding and rear- 
ranging terms in Eqn. (4), can thus be used in Eqn. (6) to obtain a formula 


for f(x3;L,M,Q): 


For L?1, define CS Ci=lhees, be=0.1) asiafo Lows: 
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Also, 16 155.2, .then, for k=letosl-2: 
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Eqn. (6) is applicable for M=L,...,@ and x=0,...,M-L+1. If x=0, then terms 
involving the argument x-1 are equal to zero. Table 1 provides formulas for 
f(x;L,M,Q) for all possible Q's for L=2 to L=8. 


Using the subsequence ACA as an example, so that L=3 and Q=1,0,1, the 


C..'s are obtained as follows: 


Oyu 
oO 

Bite 3/47 
fae -3/4° 
Ca -1/42 
cea 1/47 


Therefore, from Eqn. (6), 
f(x3lL,M,Q) = f(x3L,M-1,Q) - f(x3lL,M-2,Q)/16 + 3f(x3L,M-3,Q)/64 
+ f(x-15b,M-2,Q)/16 - 3f(x-15L,M-3,Q) /64 
as in Table 1. 


The formulas thus derived for f(x3;L,M,Q) are recursive; f(x;L,M,Q) 
depends in general on f(x;L,M-i,Q) and f(x-13;L,M-i,Q) for i=1,...,L. (The 
proof of this follows from Eqn. (3) and from the fact that no term of the 
prefix polynomial h(u/4) can be of degree greater than L-1.) Thus, a computer 
program to calculate numeric values of f(x;l,M,Q) for x = 0 to an upper limit 
J needs to store an L by J-L array of probabilities. Alternatively, a recur- 


sive programming language such as Pascal or Algol can be used. In either 
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case, initialization is performed using the boundary conditions of Eqn. (7). 
A Fortran program to compute f(x;L,M,Q), E(X), and V(X) given L, M, and the 


subsequence is available from the authors. ; 


Eqn. (6) is valid for the binomial distribution (i.e., for L=1 and Q=1) 


Be? Co = 2% and Cy = 4, yielding the recursion formula 

f(x3L,M,Q) = 3f(x3L,M-1,Q)/4 + f(x-13L,M-1,Q)/4 (9). 
Note from Eqn. (8) that if L>1, the only case in which Ch = ~ and Ch = } 
is when Q 451) in which case Q,=Q,=---=@ _=1, so that all remaining fae = 


in Eqn. (6) are nonzero. 


Table 2 shows values of f(x;4,20,Q) for the three subsequences AAAA, 
ACAC, and ACGT, chosen to represent subsequences having "high", "medium", and 


"low" degrees of overlap capability, respectively. 
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S. EXAMPLES OF THE USE OF THE PROBABILITY FUNCTION 
IN EXACT AND APPROXIMATE SIGNIFICANCE TESTS. 

The availability of formulas for f(x;L,M,Q) makes it possible to perform 
exact significance tests of the hypothesis of randomness, and the formulas’ for 
E(X) and V(X) provide the needed quantities for an approximate test. As an ex- 
ample, an 825-nucleotide-long sequence obtained from Georgetown University 
Medical Center’s Nucleic Sequence Database and shown in Table 3 will be used. 
It is described in Dayhoff et al. (1983) as a "Middle repetitive (Alu family) 
genome fragment - human (length 825)." This intergenic material was one of 14 
sequences examined in Gentleman et al. (1984). 

The subsequence CC occurs 67 times in this genome fragment. Under the 
hypothesis, the frequency of its occurrence in a sequence of length 825 has an 
expected value of 51.50 and a variance of 67.57. Table 4 shows probabilities 
and cumulative probabilities for frequencies from 30 through 70. (The complete 
range is from 0 through 824.) Using these probabilities, the significance level 
for an exact test of the hypothesis can be calculated as the sum of the 
probabilities for frequencies 267, plus the sum of the probabilities for 
frequencies <36 (these being all frequencies with probabilities less than or 
equal to Pr(67)). Thus, the significance level = .039 + .028 = .067. 

By identifying frequencies as close as possible to the .025 and .975 points 
of this discrete distribution, a 95% confidence interval is obtained; its lower 
bound is between 35 and 36, and its upper bound is between 67 and 686. 

The usual approximate Chi-square goodness-of-fit test has sometimes been 
used to compare observed and expected subsequence frequencies. (For example, 
Smith et al. (1983) used this test with expected frequencies based on the 
overall sequence base composition.) The goodness-of-fit test sums the scaled 
squared differences between observed and expected frequencies. The two observed 
frequencies in the present example are 67 (the number of occurrences of CC) and 
757 (the number of occurrences of other subsequences of length two). The 
resulting test statistic value is 4.976, so the approximate significance level 
for this test would normally be calculated as Pr (‘X.>4.976) = .0257, which is 
considerably smaller than the exact value of .04670. However, the goodness-of- 
fit test is inappropriate here, due to differences, which remain even as no, 
between #(x3L,M,Q@) and the binomial distribution (as shown in Sections 3 and 4); 
when there are only two observed frequencies, the goodness-of-fit test statistic 
is equivalent to the square of a standardized observed binomial(n,p) frequency. 


An appropriate approximate test statistic can be obtained by standardizing 
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the observed frequency of a subsequence using the correct variance (as given in 
Edua(cv?n 1. @s, by Using 1. = (x-np)* /V(X) instead of T = (x-np)® /Enp (i-p) J. 
The central limit theorem for dependent trials can then be invoked (e.g., as in 
Feller (1950), p. 374, and in Shukla and Srivastava (1985)) and a : approxima- 
tion used for sufficiently large n. In the case of the x=47 occurrences of the 
subsequence CC, T*=3.5556, yielding a much more accurate approximate sig- 
nificance level of Pr(X >3.5556) = .0594, (Note fhat T* can be used in the 
more general case of an a-letter alphabet with probabilities that are not neces- 
sarily equal.) 

Examining a longer subsequence, the observed frequency of the subsequence 
CCCC is found to be & The expectation and variance of the exact distribution 
are 3.21 and 5.23, respectively. The exact significance level is .153, and the 
approximate significance level using Tee eee? s eaene approximate significance 
level using T would be .119.) In this case, the approximate test is not ac- 
curate; the Chi-square approximation relies on expected frequencies being of 
about size five or larger, because f(x3;L,M,Q) is then more symmetric. 

This illustrates the fact that, for fixed n, the approximate test is less 
likely to be usable for a longer subsequence than for a shorter one, since €E(X) 
decreases as L increases. Fortuitously, computation of an exact significance 
level is considerably faster (and therefore cheaper) for a larger value of L 
than for a smaller one; both lower and upper tail areas are required to cal- 
culate the P-value for a two-sided test, and when E(X) is relatively small, 
fewer values of £(x3;L,M,Q) need to be calculated recursively before reaching the 
upper tail of the distribution. 

The sequence TTTTTT occurs twice in Table 3. The expected number of occur- 
rences is .20, and the variance is .33. The significance level for the exact 
test is .042. Since the expected frequency is so small, an approximate test 
would not be used. (If it were, the resulting significance level would be 
.00181 for T¥ and .00006 for T.) 

As a final example, consider the subsequences TTGTTT and AAACAA, which oc- 
cur six and five times, respectively. These subsequences are inverse comple- 
ments of each other; each consists of the complementary nucleotides of the 
other, in reverse order. Each occurs much more often than would be expected; 
the respective exact significance levels are .12x10-° and .31x10°. Perusal of 
the locations of occurrence of these subsequences reveals that all six occur- 
rences of TIGTTT are close together (beginning in positions 773, 778, 785, 789, 


and 797), and that four of the five occurrences of AAACAA are close together (in 
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positions 355, 361, 375, and 387). The two clusters of subsequences occur 
slightly more than 400 nucleotides apart in the overall sequence. This suggests 
the possibility that the sequence has a looped superstructure, stabilized by the 


bonding together of two regions which are about 400 nucleotides apart. 


6. CONCLUDING REMARKS. 

Formulas have been provided here, and methods described for obtaining any 
others which are needed, which permit means, variances, and probabilities to be 
calculated for distributions of nucleotide subsequence frequencies. Exact sig- 
nificance tests can be performed and confidence intervals calculated, or an 
approximate test can be used, to analyze patterns of nonrandomness in nucleotide 
sequences or subsequences. This can assist scientists in learning about the 
structure and functionality of the (sub)sequences. The exact methods are espe- 
cially useful when the expected subsequence frequency and/or the sequence length 
1S so small that an approximate test is not usable. On the other hand, the 
approximate test can be used for the more general case where the letters of the 
genetic alphabet have hypothesized probabilities which are not necessarily 
equal. 

Significance levels from these tests can also be used to compare two or 
more sequences, as follows: For a given subsequence S, perform the significance 
test for each sequence and compare the P-values, thus comparing the deviation of 
the sequences from a common null hypothesis. Comparison of P-values instead of 
observed frequencies permits sequences of different lengths to be compared, thus 
avoiding problems of alignment. It also permits results for subsequences having 
different lengths or overlap capabilities to be compared, since the P-value is 
standardized according to each subsequence’s own frequency distribution. 

In analyzing patterns within a sequence, it will be natural to repeat these 
tests for numerous subsequences, in which case the analyst should bear in mind 
the usual caveats appropriate for multiple comparisons. One possible approach, 
in the spirit of Daniel (1959), would be to use a 1g: (or half-normal or normal) 
probability plot to analyze multiple values of the approximate test statistic ibe 
(or its square root, or its signed square root, respectively). 

When it is appropriate to assess the degree of departure from randomness, 
these concepts and methods can also be applied to the analysis of sequences of 
amino acids in proteins, and in fields other than molecular biology, e.g., in 
time series analysis and cryptography. 

It would be useful if future research could produce a generalization of the 


frequency distribution formulas to the case of unequal probabilities for dif- 
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ferent alphabet letters. 
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TABLE 1: Formulas for f (x3 .M,Q) 
for All Possible Qs for L22 to L=8 


For notational Breet #(x3L,M,Q) is denoted a(M,x). 
Formulas are applicable for’ nt 21} and x=0,M-L+1. 
If x=0, terms involving the argument x-1 are equal to zero. 


L=2; c 
Q=0,1 (ge S=AC ) 
Sines 2 a(M-1)x) - ya t-2yx) /16 
+ a(M- ae 
Q=1,1 (e.g., S=AA) 
alM,x) = Sa(M-1,x)/4 fi 
teSa (Ma-2eX ile 
+ a(M—1,x-1)/4 - 3a(M-2,x-1)/16 
L=3; 
Q=0,0,1 (e.g., S=ACC) 
alMyx) = a(M-1,x) - a(M-3,x) /64 
+ a(M-3,x-1) /64 
Q=1,0,1 (e.g., S=ACA) 
alMsx) = a(M-1,x) 
- a(M-2,x)/16 + 3a(M-3,x) /64 
+ a(M-2,x-1)/16 - 3a(M—-3,x-1)/64 
Q=1,1,1 ee S=AAA) 
alMjx) = 3atM-1,x)/4 
+ 3a(M-2,x)/16 + 3a(M-3,x) /64 
+ a(M-1,x-1) /4 
- 3a(M-2,x-1)/16 - 3a(M-3,x-1) /64 
L=4; 
Q=0,0,0,1 (e.g., S=ACGT) 
alMyx3 = a( eiP ay - a(M-4,x)/256 
+ a(M-4,x-1) /256 
Q=0,1,0 aed S2ACAC) 
atMix} = a(M—-d,x) - a(M-2,x)/16 + a(M-3,x)/16 
- a(M—-45x) /256 
ta (M=2, k=l) A lOemsanni—o yX-1)/160 toa (M=4,%=1)/256 
@=1,0,0,1 (e.g., S=ACGA) 
almyx = Hieleth 
- a(M-3,x)/64 + 3a(M-4,x) /256 
een yy can sata Kel) 256 
Q=1)1,1,1_ (2.9.5, S=AAaA) 
alMix} = 3atM—-1,x)/4 + 3a(M-2,x)/16 
+ 3Ba(M—-34x)/64 + 3a(M—-4,x) /256 
+ al(M=1,x-1)/4 - Sa(M-2yx-1)/16 
Sa Hesetel 76s) — ssa tea ke 11/256 
L235; 
Q=0,0,0,0,1 (en9es S=ACGTT) 
almixS = a(M-{yx) - a(M-5,x)/1024 
+ a(M-5,x-1)/1024 
Q=0)1,0,0,1 (e.g. S=ACGAC) 
a Myx = a(M-1,x) 
— a(M-35x)/64 + a(M—4,x) /64 
- a(M-5,x)/1024 
+ a(M-35x—-1)/64 = a(M—4,x-1) /64 
+ a(M-5,x-1) /1024 
Q=140,0,0)1 (e.g. S=ACGTA) 
a Mix) = a(M-1,x) 
- a(M-4,x)/256 + 3a(M-5,x)/1024 
+ a(M-45x-1)/256 - 3a(M-5,x-1)/1024 
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1 (e.g., S#ACGTACA) 
=1,x) - a(M-7,x)/16384 
-7, x-1)/16384 


x) 

/1ete4 

Bid/2a0) 0a (Monk) / 200 
-1)/16384 

S=ACGTTAC) 

Py + a(M-6,x)/1024 i. 


10240 —7a(M—6,x-1)/ 1024 
16384 


., S=ACGTACA) 
~"a(M—6,x)/4096 + 3a(M-7,x)/16384 
1/4096 ~ 3a(M-7,x-1) asda 


1 (erga, S=ACGACGA) 


ei ~ 


Cs 


+++ 11 ie 
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Ow AD & ~~ 


§ 
M1, 
Mes tx 
M-6, 
(M=? x) /16384 

M-3,x-1)/64 - a(M-4,x-1) /64 
M-4yx-1)/4096 - 3a(M-7,x-1) /16384 
», S=ACAAACA) 

256 + a(M-5,x)/256 

4096 

116384 

1/256 - a(M-5,x-1) /256 

71) 74096 - 3a(M27 ,x-1)/ 16384 


S=ACACATA) 
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+ a(M-5,x)/256 
+ Ja(M-7,x) /16384 
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b= Sait= wie -1)/256 
096 - 3a(M-7,x-1)/16384 


S*AACCCAA) 


nnn RRR RRO 
eS a SS 
Oo PUNO SUN 


b] 
1024 
/4096 + 3a(M-7,x) /16384 
1/1024 
1)/4096 - 3a(M-7,x-1)/16384 
Q=1,1,1,0 
(Myx) 
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L=8 


Q=1,1,1,1,1,1,1 (e.gs, SHAAAAAAA) 
alMix3 = Raleaientirs 
+ 3a(M-25x)/14 
+ Ja(M-3,x)/64 
+ 3a(M-45x) /256 
+ 3a(M-55,x)/1024 + 3a(M-b,x) /4096 
+ 3a(M-75x) /1438 . 
+ a(M-1,x-1)/4 
= 3a(M-2,x-1)/16 
- 3a(M-3)x-1) /64 
~ oa (MA ox 1) 256 
- 3a(M-5,x-1)/1024 - 3a(M-6,x-1) /4096 
- 3a(M-75x-1) /16384 - 
Q=0,0,0,0,0,0,0,1 (e.g., S=ACGTGGTC) 
alMyx) = a(M-1,x) - alM-8,x) /65536 
+ a(M-8,x-1) /65536" 
O=010,0,1,0,0,0,1 (e.g. , S=ACGTACGT) 
alMyx) = a(M-i,x) - alM—4,x)/256 + a(M-5,x) /256 
- S(M-8'x) /45536 
+ a(M-45x-1)/256 - a(M-5,x-1)/256 
+ a(M-8)x-1) /65536 
G=0,0,1,0,0,0,0,1 (e.g., S=ACGTTACG) 
alMyx} = a(M-1,x) - alM-5,x)/1024 + a(M-6,x)/1024 
= B(M-8'x) /65536 
+ a(M-5ix-1)/1024 - a(M-6,x-1)/1024 + a(M-8,x-1) /65536 
Q=0,1,0,0,0,0,0,1 (e.g., S=ACGTGTAC) 
RUM cs ee eee ea eb en 2096) ha (MeZ 724098 
= S(M-8'x) /65536 
+ a(M-6,x-1)74096 - a(M-7,x-1)/4096 + a(M-8,x-1) /65534 
Q=0,1,0,0,1,0,0,1 (e.g., S=ACGACGAC) 
alMix} = ai(M-d,x) - alM—3,x)/64 + a(M-4,x) /64 
— a(M-61x)/4096 + a(M-7,x) /4096 
- a(M-8,x) /65536 
+ a(M-3,x-1)/64 - a(M—4,x-1) /44 
+ a(M-6)x-1)/4096 - a(M—7,x-1) /4096 
+ a(M-8,x-1) /65536 
Q20,1,0,1,0,1,0,1 (e.g. ASZACACACAC) 
alMyx$ = a(M-1,x) - alM—2,x)/16 + a(M-3,x) /16 
- a(M-45x)/256 + oths 5,x)/256 - a(M—-6,x) /4096 
+ a(M-74x) /4096 
= a(M-8)x) /65536 
+ a(M-2,x-1)/16 - a(M-3,x-1) /16 
+ a(M-4)x-1)/256 - a(M—9,x-1)/256 + a(M-6,x-1) /4096 
- a(M-7)x-1) /4096 
+ a(M-B)x-1) /65536 
Q=1,0,0,0,0,0,0,1 (e.g., S#ACCCCCCA) 
almixd = a(m-d,x) - alM-7,x)/16384 + 3a(M-8,x) /65536 
+ a(M-7,x-1)/16384 - 3a(M-8,x-1) Jassie 
G=1,0,0,1,0,0,0)1 (e.g., SsACGAACGA) 
atMyxs = a(M-d,x) - alM—4,x)/256 + a(M-5,x) /256 
= B(M-7'x) 16384 +’ Sa(MoB,x) /65532 
$ a(M-4yx-1)/258 > a (tS yx tt) ose 
+ a(M-71x-1)/16384 
= 3a(M-8,x-1) /465536 
Q=21,0,1,0,0,0,0,1 (e.g., S#ACAGGACA) 
almixs = a(m-l\x) - aly-S,x) /1024 + a(M-6,x) /1024 
= 3 (M-7'x) /16384 + 3a(M-6, x) /65536_ 
+ a(M-5)x-1)/1024 - a(M-65x-1) 71024 
+ a(M-7)x-1)/16384 - 3a(M—B8,x-1) /65536 
Q=1,1,0,0,0,0,0,1 (e.g., S=AACCCCAA) 
alm’x} = a(m-1yx) -"alm—6,x)/4096 + 3a(M-7,x)/16304 
+ Sa (MB x) (5536 | 
+ a(M-6.x-1)/4096 - Sa(M-7,x-1) /16384 
- 3a(M-8,x-1) /65536 
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10,0)1 (@.g.4 SAAGAAGAA) 

a(M-1,x) -"alM—3,x)/64 + a(M-4,x) /64 

a(M-6 x) /4096 

Sain? 4x) /16384 + Sa (M8 x) /65536 

a(M-3,x-1)/64 - a(M-4,x-1)/6 

a(M-6,x-1) /4096 

3a (M-7,x-1)/16384 - 3a(M-8,x-1) /65536 
40,0)1 (e.g. 5, SeARACGAAA) 

a(M-1,x) - alM-5,x)/1024 

Saltcre x) 74096 >) Sali 7x) /16384 

3a (M-B,x) /65536 

a(M-5,x-1)/1024 - 3a(M-6,x-1)/4096 | 

Sa(M-?,x-1)/16384 - 3a(M2B,x-1) /68536 
1,1,1 (e.g., S=AAAAAAAA) 

Sail Mute Silk 

Sa(M-2,x)/16 

3a (M-3)x) /64 

3a (M-41x) /256 

3a(M-5.x)/1024 + 3a(M-6,x) /4096 

3a (M-7,x)/16384 + 3a(M-G,x) /65536 

a(M-1,x-1)/4 

3a(M-2,x-1)/14 

Sa(M-3,x-1) /64 

3a (M-4,x-1) /256 

3a (M-55x-1)/1024 - 3a(M-6,x-1) /4096 

3a (M-7,x-1) /16384 

3a (M-Byx-1) /65536 
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TABLE 2: Values of ae Une 20,0), E(X), and avon 
for Q = 1,1 41,0,11 and 0,0 


(e.g., for aiGenedenoes AAA, Acad.’ and ACGT) 
Peet Soe ee PROBABILITY.._—ss | 
FREQUENCY AAAA ACAC ACGT 

0 0.9499E 00 0.9383E 00 0.9350E 00 
i 0: 3772E-01 0.5723E-01 0.6366E-01 
2 O.9351E-02 0.4154E-02 0.1359E-02 
3 0.2288E-02 0.2653E-03 0.9770E-05 
4 0.5527E-03 0.1509E-04 0.1629E-07  - 
5 O.1318E-03 0.7665E-06 0.9055E-12 
é 0.3099E-04 0.3442E-07 0. 

7 0.7190E-05 0.1334E-08 0. 

8 0. 1643E-05 0.4184E-10 0. 

9 0. 3698E-06 0.9095E-12 0. 

10 0.8175E-07 0. 0. 

11 0, 1773E-07 0% 0; 

12 0. 3757E-08 OQ 0; 

13 0:7749E-09 0. 0. 

14 0.1526E-09 0. 0. 

15 O,3001E-10 0. 0. 

16 0.54S7E-11 0. 0. 

17 0.9095E-12 0. 0. 
E(x) 06641 06641 06641 
V(X) 10506 07210 06477 
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TABLE 3: Example of a Nucleotide Sequence: 
Middle Da hoe (Alu family) genome fragment -_human. 


vith 82 From Georgetown University Medical Center's 
Nucleic Acid Sequence Database. é 
POSITION NUCLEOTIDES 


1- 50 CTCGAGGGAGGAGCCCGGGGCTGGGGTACGGAGGCCTCTGCACATCTTAG 
1-100 AGTAAAACAAGCAGGAGAGGCTGGGTGCGGTGGCTCATGCCTATAATCCC 
1-150 AGCACTTTAGGAGGCTGAGGCGGGCAGATCACCTGAGGTCGGGAGTTCAA 
1-200 GACCAGCCTGACCAACAGGGAGAAACCCCATCTTTACTARAACTACAAAA 
1-250 TTAGCTGGGTGTGGTGGCACATGCCTGTAATCCCAGATATTCGGGAGGCT 
1-300 GAGGCAGGAGAATCGCTTGAACCTGGGAAGCAGAGGTTGCGCTGAGCCGA 
1-350 GATGGCACCATTGCACTCCAGCCTGGGCAACGAGAGCGAAACTCCGTCTC 
351-400 AAARAAACAAAAACAAAAAAATCAAAACAATCAAAAARACARGCAGGAGG 
401-450 GGCTCTGAGGTGCCTGCAACACCCAGGTACAATCCGTGGCCCTGAGGCCC 
4351-500 ATCACAGGGAAGGGGTCTTTGCAGCTCTTTCAACCCCCAGCCCAGCATCC 
501-550 AAGGAAGCCCAGGGCAGGGAGAAACCTCAGCTGCACCATCAGAGCTCAGA 
551-600 ACAGAGAAGGCAGAAATTAGCAGGGAGTGGGGCTGGGGAGGCTTCCTAGA 
601-650 AGACGTGTCTCCCGCCTTGCTGGCACTGAGGCCTTGAGGATGGGTCCATA 
651-700 CTGGGCCCCCACTGCCAGGGATGCAGATCCGGCCCACTGCTGAAATCTGT 
701-750 GCTCCTGGAGCCTCCCTCCTGTTCATGGGCCACAGGCTGTGAARACCCCA 
751-800 GAGTCCTCCCAGGCAGCAAGTTTTGTTTTGTTTTTTGTTTGTTTGCTTGT 
801-825 TTGTTTTTTGAGAGTCTGCTCGTCA 
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Values of f(x; 
f Q= 1,1 
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(e.g., for Subsequence CC) 
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APPENDIX 


Derivation of the Probability Generating Function P(u,v) 


OO 0 
Let P(u,v) A Sy aa A v*, where oe is the probability of x 
= x=0 
occurrences of a specified string S of length L and Sate capability Q ina 
string of length M. The approach is to develop a generating function which 
counts the x occurrences of S as a substring, and to convert the result to a 
probability generating function. The derivation here is an application of the 
material in Goulden and Jackson (1983, Section 2.8), restricted to the special 
case of one distinguished substring, and developed for an alphabet N of n 
symbols. 
Let S be a non null string of length L. A cluster of length M and index 
t is a string C with a distinguished subset of members Sky Skoreees Skt 
and a distinguished set of substrings TysToreeeTy with the following 
properties: 
(1) Each Tj is the string S; 
(2) The symbol Ski is the first member of 1Tj3 
(3) The subscript k; is 1, and ky = M-+4+1, that is, the first and last L 
symbols of C are distinguished substrings; 


(4) Any consecutive pair of substrings overlap; 


(5) Every element of C occurs in at least one Jj. 


Note that not every substring of C which is identical with S need be distin- 
guished. For example, let S = ACACAC. Then a cluster of length 12 and index 
3 is ACACACACACAC, where the distinguished substrings begin in the first, 
third, and seventh position. There is a string identical to S which begins in 
the fifth position, but this is neither distinguished nor counted in the index 


count. 
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To introduce the overlap information, the prefix polynomial is used. A 
prefix of a string S is a non-null string P such that ay exist non-null 
strings X and Y such that S = PX = YP. Let aA seas denote the set of 

ot ink 
prefix lengths in S. Then the prefix polynomial is h(x) = a x 1, Note that 
this definition coincides with that given in Seeeontan os 

Note that any cluster C of index t can be uniquely decomposed into an 
ordered set of t-1 prefixes and a copy of S, with each prefix beginning with a 
distinguished element and terminating just before the following such element. 
For the cluster given above, the decomposition is AC,ACAC,ACACAC. Conversely, 
such an ordered collection gives rise to a unique cluster of index t by rever- 
Sing the above procedure. 


Let Cut denote the number of clusters of length M and index t 
b 


relative to the string S. The cluster generating function C(u,v) for S is 


C(u,v) Ss SS Cut uM ve, 


M=-0 t=0 


defined by 


In the following, we use the fact that the generating function for an 
ordered collection of objects is given by the product of the generating func- 
tions for the objects; that is, if A and B are collections of objects with 
generating functions A and B respectively, then the collection of objects 
(a,b) where a € A and b& B is A B. For further details see Goulden and 
Jackson (1983). 

Lemma. Let S be a string of length L with prefix polynomial h(x). Then the 
cluster generating function for S is 

C(u,v) = utv/(1-vh(u)). 
Proof. As noted above, a cluster of index t can be decomposed into an ordered 
collection of t-1 prefixes and a copy of S. The generating function for {st 
is uly, and for the prefixes is vh(u), so the generating function for 


clusters of weight t is 
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(vh(u))&-! uty. 


Summing over all values of t yields 


Lo 
Main Pec atin a. Cone 
t=1 


ubv/(1-vh(u)) 


as required. 

To obtain the generating function for the number of occurrences of S as a 
substring of all strings of length M from N, it is convenient to work with 
indexed strings. An indexed string T of length M and index t (relative to a 
string of length L) is a string of length M with a distinguished subset of 
entries S42 Skoree es Sky and a distinguished set of substrings 
Ti sToyeeey Ty with the following properties: 

(1yaEach 1, 1s the string $; ; 
(2) S,, is the first entry of Tj. 

Note that unlike clusters, we do not require that the first or last 
strings of length L be copies of S, nor must every element occur in a distin- 
guished string. Also, adjacent distinguished subsets need not overlap. 

Let dy t denote the number of indexed strings of length M and index t 


relative to S. Then the index string generating function for S is 


ole 
O(u,v) a a dm t uM yt, 


M=0 t=0 
Lemma. Let S be a string of length L with cluster generating function 
C(u,v). Then the indexed string generating function for S is 
Divas (1-nu-C(u,v))7- 
where n is the number of alphabet symbols. 
As with clusters, indexed sets can be uniquely decomposed into ordered collec- 
tions; this time the entries will either be a single element from N or a 


cluster. To obtain the decomposition, work from the beginning to the end, 
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examining each character ae treating it as a single entry in the ordered col- 
lection until a distinguished element Ski is hit. This will be a beginning 
of a unique cluster, which is then used as an EOEry in the collection. The 
scan continues until the end is reached. Conversely, any ordered collection 
of single elements and clusters gives rise to a unique corresponding indexed 
sequence. Since there are n alphabet symbols, the generating function for the 
set of entries for each position in the collection is nu+C(u,v), and the 
generating function for all collections of length w is (nu+C(u,v))”. Adding 
over all w yields the result. 

Note that the introduction of symbols between clusters in the creation of 
indexed sets can introduce extra copies of S (which are not distinguished in 
the cluster). Llety there may be undistinguished copies of S within the 
clusters themselves. Let T be a string of length M from N which contains 
precisely k substrings identical to S. By considering the construction of 
indexed strings, we see that T is counted in D(x,y) precisely once for each 
subset of the k copies of S. That is, D(x,y) is an "at least" generating 
function for the set of strings counted by the number of copies of S which it 
contains, in the sense of the principle of inclusion and exclusion (see, for 
example, Goulden and Jackson (1983)). 

In generating function form, the principle of inclusion and exclusion 
states that if f(z) is the generating function for the number of objects which 
contain "at least" k properties (in the above sense), then the generating 
function g(z) for the number of objects with exactly k properties is given by 
g(z) = f(z-1). Therefore if fare 
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F(u,v) = (1-nu-C(u,v-1))7!. 

There are nM possible sequences of length M, so to obtain the 
probability generating function P(u,v), replace UaeDy seus oe an (eval In 
particular, if n= 4, then 

1-(v-1)h(u/4) 
P(u,v) Be 
[1-(v-1)h(u/4)](1-u)-(u/4)~ (v-1) 


as in Section 4. 
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